Accurately predicting home prices is essential for stakeholders, such as buyers, investors, and policymakers, as it includes important financial decisions, urban planning, and policy making initiatives. However, home price predictions remain a difficult tasks, as they are influenced by a wide range of factors, including the housing characteristics, neighborhood amenities, and the overall neighborhood economic vibration. In this analysis, I build a predictive model for home prices in Charlotte, NC, using a dataset of housing information in Mecklenburg County, NC. The model will be based on the internal and external or spatial factors to predict the housing prices. The internal factors may include some housing basic characteristics, like footage, number of bedrooms, etc. The external factors may include the crime rate, median household income, the coverage of the public transportation,and access to the park and public spaces. By identifying and engineering such factors, our predictive model aims to minimize the prediction errors and provides deeper insights into the housing market in Charlotte, NC.
My analytical approach start with loading and cleaning the providing existing housing datasets. I first explored the datasets and identify the missing values and potential outliers. Then, utilizing the feature engineering process, I created new features from the existing features to improve the model performance, including the home ages and the dummy variables for the air conditioning and heating system. Additionally, spatial data and features were gathered from the American Community Survey, and open data portal, including crimes data, proximity to recreational spaces, and tree canopy. Following data preparation, I partitioned the data into training and testing subsets using an 80-20 split to reliably assess the model’s predictive performance. The model was developed using Ordinary Least Squares (OLS) regression and tested across various combinations of variables. The final model was evaluated using diagnostic metrics (MAE, RMSE, R-squared) and spatial residual analyses, including Moran’s I test, to detect spatial autocorrelation in residuals. The analysis concludes by addressing model limitations and potential to improvements.
The first step for this analysis start by loading the household data of the Charlotte, NC area. The data is in the geojson format with over 46,000 records of the housing information in Mecklenburg County, NC.
The original dataset contain many information that are not closely related with the housing prices, such as zip code, owner’s name, tax code. I filtered the datasets that only contains information about the price, bedrooms, yearbuilt, fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade, units, heatedfuel, actype, and extwall of the houses. The data is loaded and cleaned by removing the missing values and the houses with price less than 0. For prediction purposes, the price equal or below 0 is not useful.
house<- house %>%
dplyr::select(price, bedrooms,yearbuilt,fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade,units, heatedfuel, actype) %>%
filter(!is.na(price) & !is.na(bedrooms) & !is.na(yearbuilt) & !is.na(fullbaths) & !is.na(halfbaths) & !is.na(heatedarea) & !is.na(storyheigh) & !is.na(numfirepla) & !is.na(bldggrade) & !is.na(units) & !is.na(heatedfuel) & !is.na(actype))%>%
filter(price>0)After checking housing prices, a small number of homes in Charlotte have exceptionally high values, representing clear outliers. These extreme values could significantly distort our model’s ability to accurately predict typical housing prices. Therefore, I have excluded all homes valued above 10 million dollars to ensure the robustness and accuracy of our predictive model.
For the prediction purpose, the normal distribution of the variable matters because it helps ensure the accuracy and reliability of the model. When the variable is normally distributed, it usually means that the erroes or residuals are also normally distributed.
The histogram of the price shows that the price is right skewed. To make the price more normally distributed, I took the log transformation of the price. The histogram of the log price shows that the log price is more normally distributed than the original price.
ggplot(house, aes(x = price)) +
geom_histogram(fill = "lightblue", bins= 500) +
labs(title = "Histogram of Price", x = "Price", y = "Frequency")house <- house %>%
mutate(logprice = log(price))
ggplot(house, aes(x = logprice)) +
geom_histogram(fill = "lightblue", bins= 500) +
labs(title = "Histogram of Log Price", x = "Log Price", y = "Frequency")The feature engineering is the process of creating new features from the existing features. The new features are created to improve the model performance. In this analysis, I created the age of the house, from the housing built year. I also created the dummy variables for the air conditioning and heating system, the engineered categorical variable mansion. The dummy variables for the air conditioning and heating system are created by checking if the house has air conditioning or heating system, as housing with air conditioning and heating system tends to have higher price. The mansion dummy variable is created by checking if the house has more than 5000 square feet of heated area, more than 3 bedrooms, and more than 3 full baths. The story height of the house is extracted from the story height column by removing the non-numeric characters. I also converted the heatedfuel column to a new column heatfuel by replacing the “NONE” value with NA, as a new categorical variable.
The building grade is converted to a numeric variable by assigning a number to each category based on factors. The new features are created to improve the model performance.
| Grade | Description | Typical Examples | Quality & Value Level |
|---|---|---|---|
| Minimum | Very basic, minimal standards,limited durability. | Old sheds, simple outbuildings, deteriorating structures | Lowest |
| Fair | Below-average condition, noticeable wear or deferred maintenance. | Older homes/apartments with deferred maintenance, structures needing renovation | Below Average |
| Average | Standard construction quality, functional and maintained condition. | Typical single-family homes, standard apartment units | Moderate |
| Good | Above-average quality construction, good condition. | Newer suburban homes, renovated units, higher-quality commercial buildings | Above Average |
| Excellent | Superior quality, exceptional finishes, and meticulous attention to detail. | High-end residences, luxury condominiums, upscale commercial properties | High |
| Custom | Unique or specialized construction, highly personalized design,typically luxury or architecturally significant. | Luxury custom homes, unique historical restorations, architect-designed high-end properties | Highest or Specialized |
Source: https://localdocs.charlotte.edu/Tax_Collections/Reports_Studies/
house <- house %>%
mutate(
age = 2022 - yearbuilt,
storyheigh = as.numeric(gsub("[^0-9.]", "", storyheigh)),
bldggrade = case_when(
bldggrade == "MINIMUM" ~ 1,
bldggrade == "FAIR" ~ 2,
bldggrade == "AVERAGE" ~ 3,
bldggrade == "GOOD" ~ 4,
bldggrade == "VERY GOOD" ~ 5,
bldggrade == "EXCELLENT" ~ 6,
bldggrade == "CUSTOM" ~ 7),
ac_dummy = ifelse(actype == "AC-NONE", 0, 1),
heat_dummy = ifelse(heatedfuel == "NONE", 0, 1),
mansion= ifelse(heatedarea>5000 & heatedarea<10001 & bedrooms>2 & bedrooms<6 & fullbaths>2, 1, 0),
castle= ifelse(heatedarea>10000 & bedrooms>5 & fullbaths>5, 1, 0),
house= ifelse(mansion==0 & castle==0, 1, 0),
heatfuel = ifelse(heatedfuel == "NONE", NA, heatedfuel)
)After the primary feature engineering process, I removed the original columns that are not needed for the analysis. The original columns are removed to make the dataset cleaner and easier to work with. The dataset is now ready for the model building process.
Many other factors can affect the housing price that not contains in the given datasets. Many other spatial variables affect the housing price and value as well. For example, the housing values are closely correlate with the crime rates in the neighborhood, median household income and whether the house is close to the public transportation and access to the park and public spaces. All those variables should be included in the prediction model to improve the model performance.
The additional geospatial data is collected from American Community Survey, City of Charlotte open data portal and other sources. The additional geospatial data includes the crime rate, the distance to the public transportation, the distance to the park. The additional geospatial data is collected and cleaned to be used in the model building process.
Median household income data was collected from the American Community Survey. The median household income data is the median household income in the neighborhood. The median household income data is collected and cleaned to be used in the model building process.
income<- get_acs(
geography = "block group",
variables = c("income"="B19013_001"),
state = "NC",
county = "Mecklenburg",
geometry = TRUE
)%>%
st_transform(crs =st_crs(house))%>%
filter(!is.na(estimate))%>%
dplyr::select(estimate, GEOID, geometry)%>%
rename(income=estimate)#spatial join the income data to the house data
house <- st_join(house, income)%>%
filter(!is.na(income))Crime data was collected from the City of Charlotte open data portal from police station.
Transform the crime data to the spatial data and project it to the same projection as the housing data.
crime<- crime %>%
dplyr::select(LATITUDE_PUBLIC, LONGITUDE_PUBLIC, YEAR)%>%
st_as_sf(coords = c("LONGITUDE_PUBLIC", "LATITUDE_PUBLIC"), crs = 4326)%>%
st_transform(crime, crs =st_crs(house))Measure the crime rate at census block level and assign to respective house data
Public transportation coverage data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. The public transportation coverage data is the Percentage of housing units within 0.5 mile of a transit stop. The public transportation coverage data is collected and cleaned to be used in the model building process.
transit <- st_read("data/transportation.geojson")%>%
dplyr::select(X2023, geometry)%>%
rename(transit=X2023)#spatial join the transit data to the house data
house <- st_join(house, transit)%>%
filter(!is.na(transit))Tree Canopy data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. The tree canopy data is the percentage of tree canopy in the neighborhood. The tree canopy data is collected and cleaned to be used in the model building process.
tree <- st_read("data/tree_canopy.geojson")%>%
dplyr::select(X2023, geometry)%>%
rename(canopy=X2023)Approximate distance to the park data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. The data measure the percentage of housing units within 0.5 mile of an outdoor public recreation area at the census block group level.
The summary statistics and correlation matrix are calculated to understand the relationship between the variables. The summary statistics show the mean, median, standard deviation, and quantiles of the variables. The correlation matrix shows the correlation between the variables. The correlation matrix is used to understand the relationship between the variables and to identify the multicollinearity between the variables.
For the summary statistics, I calculated the mean and standard deviation of the variables. The summary statistics are calculated to understand the distribution of the variables. The summary statistics are used to understand the distribution of the variables and to identify the outliers in the data.
dependent_var <- "price"
predictors <- c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "income", "crime", "transit", "canopy", "recreation")
summary_stats <- house %>%
st_drop_geometry() %>%
dplyr::select(all_of(c(dependent_var, predictors))) %>%
summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
pivot_wider(names_from = Stat, values_from = Value)
summary_stats <- summary_stats %>%
mutate(Variable = case_when(
Variable == "price" ~ "House Value",
Variable == "bedrooms" ~ "Number of bedrooms",
Variable == "fullbaths" ~ "Number of full baths",
Variable == "halfbaths" ~ "Number of half baths",
Variable == "heatedarea" ~ "House Square footage",
Variable == "storyheigh" ~ "Number of stories",
Variable == "numfirepla" ~ "Number of fireplaces",
Variable == "bldggrade" ~ "Building Grade",
Variable == "units" ~ "Number of units",
Variable == "age" ~ "Age of the house",
Variable == "income" ~ "Median household income",
Variable == "crime" ~ "Number of Crime in Census Block group",
Variable == "transit" ~ "% Transit Coverage",
Variable == "canopy" ~ "% Tree Canopy Coverage",
Variable == "recreation" ~ "% Recreation Coverage",
TRUE ~ Variable
))
summary_stats <- summary_stats %>%
mutate(
Mean = round(Mean, 2),
SD = round(SD, 2)
)
summary_stats <- summary_stats %>%
arrange(Variable == "House Value")
predictor_rows <- which(summary_stats$Variable != "House Value")
dependent_rows <- which(summary_stats$Variable == "House Value")
# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred <- max(predictor_rows)
start_dep <- min(dependent_rows)
end_dep <- max(dependent_rows)
# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
add_header_above(c(" " = 1, "Statistics" = 2)) %>%
kable_styling(full_width = FALSE) %>%
group_rows("Predictors", start_pred, end_pred) %>%
group_rows("Dependent Variable", start_dep, end_dep)%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| Variable | Mean | SD |
|---|---|---|
| Predictors | ||
| Number of bedrooms | 3.52 | 0.91 |
| Number of full baths | 2.29 | 0.83 |
| Number of half baths | 0.60 | 0.53 |
| House Square footage | 2366.73 | 1057.49 |
| Number of stories | 1.65 | 0.48 |
| Number of fireplaces | 0.79 | 0.47 |
| Building Grade | 3.43 | 0.78 |
| Number of units | 0.98 | 0.19 |
| Age of the house | 28.26 | 24.04 |
| Median household income | 111240.12 | 50146.79 |
| Number of Crime in Census Block group | 156.21 | 144.64 |
| % Transit Coverage | 52.13 | 39.83 |
| % Tree Canopy Coverage | 50.31 | 14.57 |
| % Recreation Coverage | 55.12 | 33.92 |
| Dependent Variable | ||
| House Value | 463583.19 | 374530.87 |
predictor_vars <- house[, c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#a4133c", "white", "#a4133c"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))
Since there are high correlation between the heated area and full baths,
I will remove the full baths from the model building process. The full
baths is removed to avoid the multicollinearity between the variables.
Similarly, the number of bedrooms and heated area are highly correlated,
I will remove the number of bedroom a from the model building process.
Storyheight is also removed for the same reason.
After removing, the final correlation matrix is following:
predictor_vars <- house[, c( "halfbaths", "heatedarea", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#a4133c", "white", "#a4133c"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))Check the VIF To ensure the multicollinearity between the variables, I checked the Variance Inflation Factor (VIF) of the variables. The VIF values are calculated to further check the multicollinearity between the variables. The VIF values are used to identify the multicollinearity between the variables. The VIF values above 5 indicate high multicollinearity between the variables. The VIF values are calculated to ensure the multicollinearity between the variables. The VIF values in the model are below 5, which indicates that there is no multicollinearity between the variables.
model<- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age + ac_dummy + heat_dummy + income + crime + transit + canopy + recreation, data = house)
vif(model)## halfbaths heatedarea numfirepla bldggrade units age ac_dummy
## 1.199884 2.448252 1.249302 2.023239 1.063245 1.727478 1.176752
## heat_dummy income crime transit canopy recreation
## 1.037838 1.784013 1.604203 1.942375 1.382799 1.364279
longer<-house %>%
pivot_longer(cols = c(
"heatedarea",
"age",
"crime",
"transit"),
names_to = "Variable",
values_to = "Value")%>%
st_drop_geometry()
ggplot(longer,aes(x = Value, y = price)) +
geom_point(color = "black", size= 0.4) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c("heatedarea" = "Heated Area",
"age" = "Age of the House",
"crime" = "Number of Crime",
"transit" = "Transit Coverage"
))) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8)) +
labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
x = "Value",
y = "House price")map<-st_join(house, income)%>%
group_by(GEOID)%>%
summarise(price=mean(price))%>%
st_drop_geometry()
map<- left_join(income, map, by="GEOID")%>%
filter(!is.na(price))
ggplot(map) +
geom_sf(aes(fill= q5(price)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(map, 'price')), 0),
name = 'Housing price') +
labs(title = "Average House Price by Census Block Group",
fill = "Price") +
theme(
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 10, face = "italic"),
plot.title = element_text(size = 20, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))Q1<-ggplot(map) +
geom_sf(aes(fill= q5(income)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(map, 'income')), 0),
name = 'Income') +
labs(title = "Average Household Income by Census Block Group",
fill = "Income") +
theme(
legend.text = element_text(size = 4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
Q2<-ggplot(tree) +
geom_sf(aes(fill= q5(canopy)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
name = '% of Tree Coverage') +
labs(title = "Tree Canopy by Neighborhood",
fill = "% of Tree Coverage") +
theme(
legend.text = element_text(size =4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
recreation<-recreation%>% filter(!is.na(recreation))
Q3<-ggplot(recreation) +
geom_sf(aes(fill= q5(recreation)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
name = '% of Recreation Coverage') +
labs(title = "Recreation facilities coverage by Neighborhood",
fill = "% of Recreational facilities Coverage") +
theme(
legend.text = element_text(size = 4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
transit<-transit%>% filter(!is.na(transit))
Q4<-ggplot(transit) +
geom_sf(aes(fill= q5(transit)), color='white') +
scale_fill_manual(values = flatreds5,
labels = function(x) round(as.numeric(qBr(transit, 'transit')), 0),
name = '% of Transit Coverage') +
labs(title = "Public Transit coverage by Neighborhood",
fill = "% of Public Transit Coverage") +
theme(
legend.text = element_text(size = 4),
legend.title = element_text(size = 5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 5, face = "italic"),
plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
# Arrange the plots in a 2x2 grid
grid.arrange(Q1, Q2, Q3, Q4, ncol = 2)In this model building process, I spilt the training and testing datasets based on 80% training and 20% testing to ensure the model have a better predictivity